How to Become a Data Engineer
- Metadata:
- #article #data-science #career
- Source: How to become a data engineer
- Responsible for building and maintaining the process of delivering, storing and processing data
-
Forms the foundation of the hierarchy of data science needs: collect, move/store, explore/transform, aggregate/label, learn/optimize
-
Modern skillsets should include: intermediate knowledge of SQL and python, experience with cloud providers (AWS, Azure, GCP), knowledge with Java and Scala, understand SQL/noSQL databases (modelling, warehousing, performance optimization)
-
Expanding this skillset for FAANG companies: experience with big data tools (Hadoop, Kafka, Spark), knowledge of algorithms and data structures, understand distributed systems, BI tools (Tableau, QlikView, Looker, Superset)
- Algo and Data Structure [[Data Science Learning]]
- [[SQL]]
- There are many databases that still use SQL and many SQL based engines ([[Presto]], [[Apache Hive]], [[Impala]])
- https://mode.com/sql-tutorial/introduction-to-sql/
- https://modern-sql.com/
- https://use-the-index-luke.com/
- Programming
- Since many big data systems are written in Java or Scala, it is vital to learn your way around these two languages
- Scala: [[Apache Kafka]], [[Apache Spark]]
- Java: [[Hadoop HDFS]], Cassanra, HBase, [[Apache Hive]], [[Presto]]
- https://twitter.github.io/scala_school/
- Since many big data systems are written in Java or Scala, it is vital to learn your way around these two languages
- Big Data Tools
- A rapidly changing landscape but the most popular tools are
- [[Apache Kafka]] for message queue/event bus/event streaming
- [[Apache Spark]] for large-scale data processing
- Apache Hadoop is a big data framework that includes [[Hadoop HDFS]], [[Apache Hive]], HBase for moving and storing data
- [[Apache Druid]] is a real-time analytics database
- https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
- A rapidly changing landscape but the most popular tools are
- Data Pipelines
- [[Apache Airflow]], Spotify Luigi, Perfect, Dagster
- [[Introduction to Apache Airflow]]
-